25 research outputs found
Multi-GPU Graph Analytics
We present a single-node, multi-GPU programmable graph processing library
that allows programmers to easily extend single-GPU graph algorithms to achieve
scalable performance on large graphs with billions of edges. Directly using the
single-GPU implementations, our design only requires programmers to specify a
few algorithm-dependent concerns, hiding most multi-GPU related implementation
details. We analyze the theoretical and practical limits to scalability in the
context of varying graph primitives and datasets. We describe several
optimizations, such as direction optimizing traversal, and a just-enough memory
allocation scheme, for better performance and smaller memory consumption.
Compared to previous work, we achieve best-of-class performance across
operations and datasets, including excellent strong and weak scalability on
most primitives as we increase the number of GPUs in the system.Comment: 12 pages. Final version submitted to IPDPS 201
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Performance Characterization of High-Level Programming Models for GPU Graph Analytics
We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gunrock, MapGraph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient large-scale graph analytics on the GPU
Distributed Equivalent Substitution Training for Large-Scale Recommender Systems
We present Distributed Equivalent Substitution (DES) training, a novel
distributed training framework for large-scale recommender systems with dynamic
sparse features. DES introduces fully synchronous training to large-scale
recommendation system for the first time by reducing communication, thus making
the training of commercial recommender systems converge faster and reach better
CTR. DES requires much less communication by substituting the weights-rich
operators with the computationally equivalent sub-operators and aggregating
partial results instead of transmitting the huge sparse weights directly
through the network. Due to the use of synchronous training on large-scale Deep
Learning Recommendation Models (DLRMs), DES achieves higher AUC(Area Under
ROC). We successfully apply DES training on multiple popular DLRMs of
industrial scenarios. Experiments show that our implementation outperforms the
state-of-the-art PS-based training framework, achieving up to 68.7%
communication savings and higher throughput compared to other PS-based
recommender systems.Comment: Accepted by SIGIR '2020. Proceedings of the 43rd International ACM
SIGIR Conference on Research and Development in Information Retrieval. 202
Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs
Connected components and spanning forest are fundamental graph algorithms due
to their use in many important applications, such as graph clustering and image
segmentation. GPUs are an ideal platform for graph algorithms due to their high
peak performance and memory bandwidth. While there exist several GPU
connectivity algorithms in the literature, many design choices have not yet
been explored. In this paper, we explore various design choices in GPU
connectivity algorithms, including sampling, linking, and tree compression, for
both the static as well as the incremental setting. Our various design choices
lead to over 300 new GPU implementations of connectivity, many of which
outperform state-of-the-art. We present an experimental evaluation, and show
that we achieve an average speedup of 2.47x speedup over existing static
algorithms. In the incremental setting, we achieve a throughput of up to 48.23
billion edges per second. Compared to state-of-the-art CPU implementations on a
72-core machine, we achieve a speedup of 8.26--14.51x for static connectivity
and 1.85--13.36x for incremental connectivity using a Tesla V100 GPU
Gunrock: A Programming Model and Implementation for Graph Analytics on Graphics Processing Units
The high-performance, highly parallel, fully programmable modern Graphics Processing Unit's high memory bandwidth, computing power, excellent peak throughput, and energy efficiency brings acceleration to regular applications that have extensive data parallelism, regular memory access patterns, and modest synchronizations. However, for graph analytics, the inherent irregularity of graph data structures leads to irregularity in data access and control flow, making efficient graph analytics on GPUs a significant challenge. Despite some promising specialized GPU graph algorithm implementations, parallel graph analytics on the GPU in general still faces two major challenges. The first is the programmability gap between low-level implementations of specific graph primitives and a general graph processing system. Programming graph algorithms on GPUs is difficult even for the most skilled programmers. Specialized GPU graph algorithm implementations do not generalize well since they often couple a specific graph computation to a specific type of parallel graph operation. The second is the lack of a GPU-specific graph processing programming model. High-level GPU programming models for graph analytics often recapitulate CPU programming models and do not compare favorably in performance with specialized implementations due to different kinds of overhead introduced by maintaining a high-level framework. This dissertation seeks to resolve the conflict of programmability and performance for graph analytics on the GPU by designing a GPU-specific graph processing programming model and building a graph analytics system on the GPU that not only allows quick prototyping of new graph primitives but also delivers the performance of customized, complex GPU hardwired graph primitives. To achieve this goal, we present a novel data-centric abstraction for graph operations that allows programmers to develop graph primitives at a high level of abstraction while simultaneously delivering high performance by incorporating several profitable optimizations, which previously were only applied to different individual graph algorithm implementations on the GPU, into the core of our implementation, including kernel fusion, push-pull traversal, idempotent traversal, priority queues, and various workload mapping strategies. We design and implement a new graph analytics system, Gunrock, which contains a set of simple and flexible graph operation APIs that can express a wide range of graph primitives at a high level of abstraction. Using Gunrock, we implement a large set of graph primitives, which span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. All of our graph primitives achieve comparable performance to their hardwired counterparts and significantly outperform previous programmable GPU abstractions